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Science  and  engineering  abounds  with  different  types  of  networks.  Examples  include  social  networks  such 
as  FaceBook  and  Twitter,  networks  of  genes  and  proteins  in  molecular  biology,  network  models  for  economic 
and  market  dynamics,  neural  networks  in  brain  imaging,  networks  of  disease  transmission  in  epidemiology, 
and  information  networks  in  law  enforcement.  In  the  real-world,  the  structure  of  the  underlying  network  is  not 
known,  but  instead  one  observes  samples  of  the  network  behavior  (e.g.,  packet  counts  in  a  computer  network; 
instances  of  infection  at  given  time  instances  of  an  epidemic;  emails  or  text  messages  sent  among  a  group  of 
people).  Since  the  network  data  are  complex,  noisy  and/or  high-dimensional,  it  is  challenging  to  infer  the 
network  structure.  Developing  methods  for  solving  this  network  inference  problem  will  have  a  broad  range 
of  applications.  Examples  include  inferring  brain  connectivity  and  disease  etiology  in  neuroimaging  studies, 
detecting  terrorist  cells  in  social  networks,  monitoring  intrusions  in  computer  networks,  and  understanding 
the  basis  of  gene-protein  interactions  in  systems  biology. 

Spectral  clustering  is  one  of  the  popular  techniques  to  identify  communities  (or  clusters)  in  large  network. 
The  stochastic  blockmodel  is  a  social  network  model  with  well-defined  communities;  each  node  is  a  member 
of  one  community.  In  paper  [21],  we  provide  rigorous  statistical  analysis  to  the  study  of  community  detection 
by  assessing  how  well  spectral  clustering  can  estimate  the  clusters  in  the  Stochastic  Blockmodel.  Our  results 
are  the  first  clustering  results  that  allow  the  number  of  clusters  in  the  model  to  grow  with  the  number  of 
nodes,  hence  the  name  high-dimensional.  In  paper  [8],  we  study  the  impact  of  regularization  on  spectral 
clustering  and  attempt  to  quantify  the  obtained  improvement.  We  study  in  paper  [7]  the  performance  of 
spectral  clustering  in  recovering  the  latent  labels  of  i.i.d.  samples  from  a  finite  mixture  of  nonparametric 
distributions.  We  provide  a  novel  and  useful  characterization  of  the  principal  eigenspace  of  the  population- 
level  normalized  Laplacian  operator  and  establish  a  certain  geometric  property  of  nonparametric  mixtures: 
embedded  samples  from  different  components  are  approximately  orthogonal  with  high  probability. 

Aymmetric  and  undirected  relationships  are  common  assumptions  in  the  clustering  literature.  However, 
the  vast  majority  of  relationships  are  asymmetric  or  directed.  For  example,  in  the  gene  regulatory  network, 
one  gene  drives  the  transcription  of  the  other  gene.  In  the  power  grid  network,  electricity  flows  from  one 
node  to  the  other.  In  a  communication  network,  one  node  initiate  the  conversation.  In  other  examples,  it 
might  be  more  easy  to  observe  the  relationship  without  direction,  but  the  direction  remains  of  fundamental 
importance.  For  example,  in  a  social  network,  a  business  searching  for  “trend  leaders”  wants  to  know  the 
direction  of  influence  in  relationships.  It  is  an  interesting  and  important  question  to  identify  the  clustering 
asymmetries  in  directed  graphs.  In  paper  [2],  we  propose  a  novel  spectral  co-clustering  algorithm  called 
DI-SIM  for  asymmetry  discovery  and  directional  clustering.  A  new  Stochastic  co-Block  model  is  introduced 
to  show  favorable  properties  of  DI-SIM.  To  accommodate  sparse  graphs  and  highly  heterogeneous  degrees 
within  clusters,  DI-SIM  uses  the  regularized  graph  Laplacian  and  projection  procedure.  We  apply  a  node¬ 
wise  asymmetry  score  and  DI-SIM  to  analyze  the  clustering  asymmetries  in  the  networks  of  Enron  emails, 
political  blogs,  and  the  chemical  connectome.  In  each  example,  a  subset  of  nodes  have  clustering  asymmetries; 
these  nodes  send  edges  to  one  cluster,  but  receive  edges  from  another  cluster.  Such  nodes  yield  insightful 
information  (e.g.  communication  bottlenecks)  about  directed  networks,  but  are  missed  if  the  analysis  ignores 
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edge  direction. 


2  Systems  biology 

Yu  has  advanced  the  project  with  Dr.  Frise  et  al  on  systems  biology.  Gene-gene  interaction  is  at  the 
heart  of  understanding  regulatory  pathways  of  organ  formation  and  developmental  disorders.  Spatial  gene 
co-occurrence  information  has  been  shown  to  be  extremely  useful  in  suggesting  possible  gene-gene  interac¬ 
tion.  The  abundance  of  spatial  gene  expression  data  in  recent  years  opens  up  an  exciting  new  venue  for 
reconstructing  gene  regulatory  networks.  However,  due  to  the  complexity  of  spatial  gene  expression  and 
the  noisy  nature  of  the  data  acquiring  process,  extracting  meaningful  information  from  these  data  remains 
a  challenge.  In  paper  [1],  we  propose  StaNMF  method  that  combines  a  fast  and  scalable  implementation 
of  Non- negative  Matrix  Factorization  (NMF)  with  a  new  stability-based  criterion.  StaNMF  learns  from  a 
spatial  gene  expression  data  a  set  of  data-driven  basis,  called  Principal  Patterns  (PPs).  As  an  example, 
using  the  spatial  gene  expression  images  of  early  stage  embryonic  Drosophila  melanogaster,  we  demonstrate 
that  the  21  learned  PPs  correspond  to  21  localized  pre-organ  regions.  The  PPs  provide  a  concise  yet  biolog¬ 
ically  interpretable  representation,  comparable  to  the  well-established  Drosophila  fate  map  and  serving  as 
an  alternative  to  human  annotations.  Based  on  the  PPs,  we  construct  spatially  local  correlation  networks 
for  all  patterned  transcription  factors  during  early  Drosophila  development.  With  a  two-tailed  2.5%  cut-off, 
the  constructed  networks  are  consistent  with  10  out  of  12  links  in  the  well-studied  gap-gene  network  with 
six  major  gap  genes.  The  very  promising  performance  of  PPs  with  the  Drosophila  data  suggests  StaNMF  as 
a  standard  decomposition  approach  to  examine  complex  and  noisy  gene  expression  data. 

Our  local  network  analysis  recommends  five  uncharacterized  genes  as  possible  new  candidates  for  the  gap 
gene  networks.  Dr.  Frises  group  and  his  collaborators  have  been  working  on  CRISPR  experiment  to  knock 
out  each  of  the  five  candidate  genes  as  experimental  verification.  So  far,  we  learned  that  one  of  the  genes 
are  not  viable,  i.e.  the  fruit  fly  dies  after  the  knock-out  of  the  gene.  Further  examination  of  the  ftz  stained 
embryos  indicates  that  the  lack  of  the  gene  might  lead  to  an  elongated  head  and  a  wider  gap  between  the 
first  and  the  second  segmentation  stripes.  Together  with  three  students  of  mine,  we  are  currently  performing 
cell  counting  analysis  in  an  effort  to  provide  numerical  evidence  of  our  visual  inspection  conclusion. 

Given  the  success  of  our  approach  for  spatial  gene  expression  analysis  for  early  stage  fruit  fly  embryos, 
we  are  in  a  process  to  extend  it  to  model  later  stage  gene  expression.  Due  to  the  formation  of  internal  organs 
of  the  embryo,  registering  the  embryo  unto  a  standard  template  for  cross-individual  comparison  can  be 
challenging.  We  build  an  organ  classification  and  registration  model  that  modifies  state  of  the  art  computer 
vision  algorithms  to  produce  mid-level  image  features  well  suited  to  bio-imaging  tasks.  By  combining  our 
classification  model  with  non-negative  matrix  factorization,  we  produce  parts-based  representations  of  spatial 
gene  expression  in  various  organ  systems.  Our  PPs  of  gene  expression  are  interpretable,  low  dimensional 
representations  of  the  data  that  serve  as  a  late  stage  analogue  to  the  Drosophila  fate  map. 

We  have  put  large  effort  into  automatization  of  our  techniques  for  the  benefits  of  system  biology  commu¬ 
nity.  To  facilitate  automatic  imaging,  we  contribute  to  designing  a  method  that  detects  the  most  in-focus 
image  as  the  microscope  adjusts  its  focal  plane.  To  scale  up  and  speed  up  the  computation  for  larger  spatial 
gene  expression  data  sets,  we  are  working  with  Professor  Andy  Yao’s  group  at  Tsinghua  University  to  cre¬ 
ate  a  general  computation  framework  for  biological  data  processing.  Our  system,  called  LSEMS,  or  Large 
Scale  Experiment  Management  System,  combines  Gitlab  for  easy  version  control,  MongoDB  for  storage  of 
highly  biological  unstructured  data  and  SPARK  for  distributed  computing.  The  LSEMS  framework  allows 
biologists  to  submit  a  task  on  their  personal  laptops,  triggering  the  remote  machine  to  execute  the  task  and 
distribute  it  to  a  large  computer  cluster.  The  results  will  be  sent  back  to  the  user-end  once  the  computation 
is  completed.  The  initial  testing  reports  that  the  system  is  much  more  efficient  than  the  old  one  which  uses 
a  single  machine.  We  are  now  building  intuitive  and  easy-to-use  GUI  (graphical-user  interface)  to  make  the 
system  more  accessible  to  general  biologists. 
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Figure  1:  Learning  principal  patterns  (PP)  by  staNMF  from  spatial  gene  expression  patterns.  (A)  Expression 
patterns  of  two  genes,  nub  and  salm,  in  Drosophila  embryos.  (B)  For  a  given  number  K ,  NMF  factorizes 
the  nonnegative  data  matrix  X,  the  columns  of  which  are  gene  expression  images,  into  the  product  of  two 
nonnegative  matrices:  dictionary  D ,  which  contains  the  K  PP,  and  coefficient  matrix  A ,  which  contains  the 
nonnegative  coefficients  of  the  images.  (C)  StaNMF  identified  K  =  21  to  be  the  optimal  number  of  PP  for 
15  <  K  <  30.  (D)  The  Drosophila  fate  map  (center),  surrounded  by  the  21  PP  learned  by  staNMF.  The  PP 
are  arrayed  according  to  the  corresponding  regions  of  the  fate  map. 
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Figure  2:  Modeling  and  validation  of  the  Drosophila  gap  gene  network  with  spatially  local  correlation 
networks  (SLCN).  (A)  The  SLCN  for  six  gap  genes.  For  each  of  the  six  gap-PP,  shown  is  the  sub- network 
of  the  SLCN  that  contains  the  six  gap-genes.  Links  are  numbered  from  1  to  14.  (B)  The  gap  gene  network 
diagram  depicting  repressive  interactions  of  six  genes.  Links  are  numbered  from  1  to  11  and  multiple 
occurrence  of  the  same  gene  are  subscripted  by  numbers  (e.g.  hbl  and  hb2).  The  directions  of  the  interactions 
are  not  indicated.  (C)  Expression  patterns  of  the  six  gap  genes  and  their  linearly  ordered  PP  representation. 
For  each  gene,  the  regions  depicted  in  blue  are  the  gap-PP  with  sPP  coefficient  greater  or  equal  to  0:1.  The 
*  symbol  indicates  a  region  of  gene  expression  with  no  match  in  (B). 


3  Neuroscience 

In  computational  neuroscience,  it  is  important  to  estimate  well  the  proportion  of  signal  variance  in  the  total 
variance  of  neural  activity  measurements.  Paper  [14]  proposes  a  novel  method  to  estimate  the  explainable 
variance  in  functional  MRI  (fMRI)  brain  activity  measurements  when  there  are  strong  correlations  in  the 
noise.  Our  shuffle  estimator  is  nonparametric,  unbiased,  and  built  upon  the  random  effect  model  reflecting 
the  randomization  in  the  fMRI  data  collection  process.  Motivated  by  collaborative  research  in  neuroscience, 
papers  [20,  15]  answer  questions  under  the  Pearl  causal  inference  framework,  which  is  an  alternative  to  the 
Neyman-Rubin  framework.  In  particular,  [20]  proves  analytical  results  to  raise  a  red-flag  on  the  commonly 
assumed  faithfulness  assumption.  [15]  proposes  efficient  MCMC  algorithms  to  search  through  Markov  equiv¬ 
alence  classes  of  a  causal  graph.  It  was  the  first  such  algorithm  that  works  fast  enough  for  hundreds  of  nodes, 
admittedly  under  a  sparsity  condition. 

Yu  continues  her  collaborative  work  with  the  Gallant  Lab  on  understanding  visual  pathway  of  primates 
by  using  sparse  coding,  invariant  features  and  deep  convolutional  neural  networks  to  build  more  accurate 
models  of  motion  perception  in  the  visual  cortex.  Our  prior  work  has  shown  that  image  representations  based 
the  principles  of  sparse  coding  and  nonlinear  spatial  pooling  are  empirically  successful  in  explaining  neural 
response  in  area  V4  of  the  macaque  visual  cortex.  Such  models  are  able  to  discriminate  between  categories 
of  image  regions  (such  as  foreground  vs.  background,  texture  vs.  contour),  while  also  generalizing  across 
random  realizations  of  object  categories,  due  to  their  invariance  to  local  deformation.  These  techniques 
have  been  adapted  to  modeling  higher  order  visual  areas  such  as  area  MT  on  two  experimental  datasets 
provided  by  the  Gallant  lab,  in  which  the  stimulus  consisted  of  a  series  of  short  movie  clips.  This  includes 
electrophysiological  data  from  macaque  area  MT,  as  well  as  full  visual  cortex  fMRI  recordings  of  human 
subjects. 
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Deep  convolutional  neural  networks  are  biologically  inspired  neural  network  based  learning  techniques. 
Recently,  they  have  been  the  state-of-the-art  methods  for  large-scale  image  recognition  tasks  in  computer 
vision.  We  deploy  the  deep  convolutional  neural  network  features  as  early  invariant  features  for  modeling 
area  V4.  When  combined  with  sparse  linear  modeling,  we  show  that  our  deep  convolutional  feature  based 
model  outperforms  the  previous  sparse  coding  based  methods.  Further  more,  our  model  not  only  has  better 
prediction  performance,  but  also  leads  to  better  interpretation,  which  could  provide  neuro-scientists  clues 
about  the  receptive  fields  and  orientation  tuning  preferences  of  individual  neurons.  We  have  also  started 
to  adapt  deep  convolutional  neural  networks  techniques  to  model  data  collected  from  fMRI  experiments. 
The  main  difficulty  of  modeling  human  neuronal  activity  via  fMRI  experiments  is  that  extracting  different 
meaningful  features  from  video  data  that  are  general  enough  to  model  the  dynamics  different  parts  of  the 
visual  cortex  area,  ranging  from  the  early  sensory  area  such  as  LGN,  VI  and  V2,  to  higher  order  areas 
such  as  MT  and  IT.  Our  approach  is  to  use  a  two  stream  deep  convolutional  neural  networks,  combining  the 
spatial  invariant  features  with  temporal  optical  flow  features.  Our  preliminary  results  show  that  the  temporal 
optical  flow  features  are  not  very  good  features  when  used  alone  for  prediction,  but  when  combined  with 
spatial  features,  these  temporal  features  largely  improves  prediction  when  compared  to  previous  hard-crafted 
Gabor  features  based  models. 

3.1  Aerosol  Optical  Depth  (AOD)  retrieval 

Yu  has  been  collaborating  with  environmental  scientists  at  JPL  and  Emory  University  to  retrieval  from  NASA 
MISR  remote  sensing  images  aerosol  index  AOD  for  air  pollution  monitoring  and  management.  Satellite- 
retrieved  Aerosol  Optical  Depth  (AOD)  can  potentially  provide  an  effective  way  to  complement  the  spatial 
coverage  limitation  of  ground  particulate  air  pollution  monitoring  network  like  AErosol  RObotic  NETwork 
(AERONET).  Although  the  MISRs  aerosol  products  lead  to  exciting  research  opportunities  to  study  particle 
composition  at  a  regional  scale,  its  spatial  resolution  is  too  coarse  for  analyzing  urban  areas,  where  the  air 
pollution  has  stronger  spatial  variations  and  can  severely  impact  public  health  and  the  environment.  Using 
NASA’s  novel  multi-angle  satellite  sensor  MISR,  [16]  develops  a  novel  AOD  retrieval  algorithm  with  4.4  km 
x  4.4  km  resolution  using  Bayesian  models  and  MCMC.  [6]  uses  AERONET  DRADON  campaign  data  from 
2011  in  the  Baltimore  area  to  further  validate  our  MCMC  algorithm.  We  show  that  our  MCMC  algorithm 
substantially  improves  over  the  MISR  operational  algorithm  both  in  terms  of  coverage  and  root-mean-square- 
error  (RMSE). 


4  Statistical  guarantees  of  EM  algorithm 

The  EM  algorithm  is  a  widely  used  tool  in  maximum-likelihood  estimation  in  incomplete  data  problems. 
Existing  theoretical  work  has  focused  on  conditions  under  which  the  iterates  or  likelihood  values  converge, 
and  the  associated  rate  of  convergence.  Such  guarantees  do  not  distinguish  whether  the  ultimate  fixed  point 
is  a  global  or  local  optimum  of  the  sample  likelihood,  nor  its  relation  to  the  global  optima  of  the  idealized 
population  likelihood  (obtained  in  the  limit  of  infinite  data) .  In  paper  [3] ,  we  develop  theoretical  framework 
for  quantifying  when  and  how  quickly  EM-type  iterates  converge  to  a  small  neighborhood  of  a  given  global 
optimum  of  the  population  likelihood.  For  correctly  specified  models,  such  a  characterization  yields  rigorous 
guarantees  on  the  performance  of  two-stage  estimators  in  which  an  initial  pilot  estimator  is  refined  with 
iterations  of  the  EM  algorithm.  Our  analysis  is  divided  into  two  parts:  a  treatment  of  the  EM  and  gradient 
EM  algorithms  at  the  population  level,  followed  by  results  that  apply  to  these  algorithms  on  a  finite  set  of 
samples.  We  verify  our  conditions  and  give  tight  characterizations  of  the  region  of  convergence  for  three 
canonical  problems  of  interest:  mixture  of  Gaussians,  mixture  of  regressions,  and  linear  regression  with 
covariates  missing  completely  at  random. 
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5  Sparse  modeling 


Sparse  models  are  necessary  for  model  interpretability  and  computational  efficiency  in  prediction.  For 
example,  expressing  signals  as  sparse  linear  combinations  of  a  dictionary  basis  has  enjoyed  great  success  in 
applications  ranging  from  image  denoising  to  audio  compression.  For  certain  data  types  such  as  natural  image 
patches,  predefined  dictionaries  like  the  wavelets  are  usually  available.  However,  when  a  less-known  data 
type  is  encountered,  a  new  dictionary  has  to  be  designed  for  effective  representations.  Dictionary  learning, 
or  sparse  coding,  learns  adaptively  a  dictionary  from  a  set  of  training  signals  such  that  each  signal  has  sparse 
representations  under  this  dictionary.  In  paper  [4] ,  we  study  the  theoretical  properties  of  learning  a  dictionary 
from  a  set  of  N  signals  via  /-| -minimization.  We  establish  a  sufficient  and  almost  necessary  condition  for 
the  reference  dictionary  to  be  locally  identifiable,  i.e.  a  local  minimum  of  the  expected  Zi-norm  objective 
function.  With  collaborators  including  students  and  postdocs,  Yu  has  published  five  papers  [10,  17,  18,  19,  11] 
on  other  topics  of  sparse  modeling.  The  results  provide  insights  on  application  of  sparse  classification  to  the 
problem  of  topic-specific  summarization,  when  and  why  Lasso  works  under  Poisson-like  Heteroscedasticity, 
the  complexity  of  Lasso  solution  path,  minimax-optimal  rates  for  sparse  additive  models  over  Kernel  classes, 
and  optimal  data-dependent  stopping  rule  of  gradient  descent  for  non-parametric  regression. 


6  Stability  and  inference 

When  data  are  perturbed  (e.g.  by  subsampling),  instability  of  results  is  common  for  big  data,  which  are  often 
high-dimensional.  This  instability  begs  a  connection  with  Robust  Statistics  of  Tukey  and  Huber.  To  bringing 
stability  and  hence  interpretability  and  reproducibility  to  results  of  Lasso  in  high-dimension,  [9]  proposes  an 
estimation  stability  (ES)  metric  to  combine  with  the  popular  cross-validation  (CV)  for  a  dominant  sparse 
modeling  method  Lasso  (or  i\-  penalized  Least  Squares)  that  has  been  effective  in  our  neuroscience  work. 
For  an  image-fMRI  data  set  from  the  Gallant  Lab,  we  in  fact  improve  interpretability  substantially  without 
losing  prediction  performance,  relative  to  CV.  [13]  is  an  invited  paper  for  a  special  issue  of  Bernoulli.  It 
advocates  for  an  enhanced  emphasis  on  stability  as  a  means  to  work  towards  reproducibility  and  promotes 
stability  as  a  general  statistical  principle. 

Inference  and  constructing  confidence  intervals  for  parameter  estimation  play  important  role  in  resulting 
interpretability  findings.  However,  in  high-dimensional  setting,  inference  is  challenging  because  the  limiting 
distribution  of  estimators  such  as  Lasso  is  complicated  and  hard  to  compute,  which  remains  a  barrier  to 
widespread  adoption  of  high-dimensional  methodology  in  the  sciences.  In  paper  [12],  we  propose  a  valid 
inference  procedure  based  on  residual  bootstrap  after  two-stage  estimator  Lasso+mLS  (using  Lasso  to  select 
a  model  and  then  using  a  modified  version  of  Least  Squares  (mLS)  refitting  the  coefficients  in  the  selected 
model)  and  show  consistency  under  suitable  conditions.  Compared  with  existing  methods,  such  as  debiasing, 
our  method  provides  comparable  results  in  terms  of  coverage  probability  and  interval  length,  but  our  method 
is  based  on  standard  tools,  the  bootstrap  and  the  Lasso,  which  is  simple  to  implement  and  can  be  easily 
extended  to  models  beyond  linear  regression. 


7  Statistical  analysis  of  algorithm  leveraging 

With  rapid  advances  of  information  technology,  massive  datasets  are  collected  by  all  fields  of  science,  engi¬ 
neering,  social  science,  business,  and  government.  Useful  or  meaningful  information  is  extracted  from  these 
data  often  through  statistical  means  or  model  fitting,  typically  through  regression  models.  These  models 
are  useful  for  predicting  a  response  variable  from  p  predictor  variables  or  to  describe  relationships  between 
predictor  variables  and  a  response  variable.  Given  a  set  of  n  data  units,  in  modern  massive  data  sets,  p 
and/or  n  can  be  large,  in  which  case  conventional  algorithms  face  computational  challenges.  Subsampling 
of  rows  and/or  columns  of  a  data  matrix  has  been  employed  traditionally  as  a  heuristic  to  reduce  the  size  of 
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large  data  sets.  Recently,  an  innovative  and  effective  sampling  scheme  based  on  using  the  empirical  statis¬ 
tical  leverage  scores  as  a  nonuniform  importance  sampling  distribution  has  been  proposed.  The  OLS  based 
on  such  a  subsample  has  been  shown  to  give  a  good  approximation  to  the  OLS  based  on  full  data  (when 
p  is  small  and  n  is  large),  both  in  worst-case  theory  and  in  high-quality  numerical  implementation.  The 
statistical  properties  of  these  algorithms  are  as  of  yet  unexplored  and  are  of  interest  for  both  fundamental 
and  very  practical  reasons;  and  it  is  these  properties  that  this  project  will  address.  One  important  question 
to  be  answered  for  using  leverage  subsampling  for  statistical  estimation  is:  Under  what  conditions  on  p  and 
n  and  the  underlying  model,  the  resulting  leverage-OLS  has  good  statistical  properties  such  as  a  good  mean- 
squared-error  (MSE)  when  compared  to  the  full-data  OLS  and  other  estimators,  either  in  linear  regression 
or  non-linear  regression  models?  Because  of  the  noise  properties  in  real  data,  it  is  challenging  to  answer  this 
question. 

In  this  project,  we  provide  the  first  interpretation  of  algorithmic  leveraging  paradigm  from  a  statistical 
analysis  point  of  view.  By  performing  a  Taylor  series  analysis  around  the  ordinary  least-squares  solution  to 
approximate  the  subsampling  estimators  as  linear  combinations  of  random  sampling  matrices,  we  provide  in 
paper  [5]  a  simple  yet  effective  framework  to  evaluate  the  statistical  properties  of  algorithmic  leveraging  in  the 
context  of  estimating  parameters  in  a  linear  regression  model  with  a  fixed  number  of  predictors.  In  particular, 
for  several  versions  of  leverage-based  sampling,  we  derive  results  for  the  bias  and  variance,  both  conditional 
and  unconditional  on  the  observed  data.  We  show  that  from  the  statistical  perspective  of  bias  and  variance, 
neither  leverage-based  sampling  nor  uniform  sampling  dominates  the  other,  which  is  particularly  striking, 
given  the  well-known  result  that,  from  the  algorithmic  perspective  of  worst-case  analysis,  leverage-based 
sampling  provides  uniformly  superior  worst-case  algorithmic  results,  when  compared  with  uniform  sampling. 
Based  on  these  theoretical  results,  we  propose  and  analyze  two  new  leveraging  algorithms:  one  constructs 
a  smaller  least  squares  problem  with  “shrinkage”  leverage  scores  (SLEV),  and  the  other  solves  a  smaller 
and  unweighted  (or  biased)  least  squares  problem  (LEVUNW).  A  detailed  empirical  evaluation  of  existing 
leverage-based  methods  as  well  as  these  two  new  methods  is  carried  out  on  both  synthetic  and  real  data  sets. 
The  empirical  results  indicate  that  our  theory  is  a  good  predictor  of  practical  performance  of  existing  and 
new  leverage-based  algorithms  and  that  the  new  algorithms  achieve  improved  performance.  For  example, 
with  the  same  computation  reduction  as  in  the  original  algorithmic  leveraging  approach,  our  proposed  SLEV 
typically  leads  to  improved  biases  and  variances  both  unconditionally  and  conditionally  (on  the  observed 
data),  and  our  proposed  LEVUNW  typically  yields  improved  unconditional  biases  and  variances. 
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